Predicting which patients with cancer will see a psychiatrist or counsellor from their initial oncology consultation document using natural language processing

Background Patients with cancer often have unmet psychosocial needs. Early detection of who requires referral to a counsellor or psychiatrist may improve their care. This work used natural language processing to predict which patients will see a counsellor or psychiatrist from a patient’s initial oncology consultation document. We believe this is the first use of artificial intelligence to predict psychiatric outcomes from non-psychiatric medical documents. Methods This retrospective prognostic study used data from 47,625 patients at BC Cancer. We analyzed initial oncology consultation documents using traditional and neural language models to predict whether patients would see a counsellor or psychiatrist in the 12 months following their initial oncology consultation. Results Here, we show our best models achieved a balanced accuracy (receiver-operating-characteristic area-under-curve) of 73.1% (0.824) for predicting seeing a psychiatrist, and 71.0% (0.784) for seeing a counsellor. Different words and phrases are important for predicting each outcome. Conclusion These results suggest natural language processing can be used to predict psychosocial needs of patients with cancer from their initial oncology consultation document. Future research could extend this work to predict the psychosocial needs of medical patients in other settings.


Table of Contents
. Performance of all models when predicting if patients will see a psychiatrist in twelve months, with extended metrics .13    Table ST3.Statistical comparison of models when predicting if patients will see a psychiatrist in the twelve months following their initial oncologist consultation ..  ST4.Performance of all models when predicting if patients will see a counsellor in twelve months, with extended metrics ... 15    Table ST5.Statistical comparison of models when predicting if patients will see a counselor in the twelve months following their initial oncologist consultation ..

Note SN1: Supplementary Methodological Note
Our methods are based upon prior work using this dataset and these neural models 1 , with new tuning, and conducting multiple evaluations to estimate variance, given the increased variance associated with the class-imbalance of the seeing-a-psychiatrist (SaP) target, though seeing-a-counsellor (SaC) was not as imbalanced.

Obtaining Data
As used in this prior study, unstructured text documents were provided as extracted Microsoft Word documents from BC Cancer electronic health records, along with some structured data including diagnosis date, age, and cancer site at diagnosis.Additional structured data, specifically death dates, were obtained from the BC Vital Statistics by BC Cancer data stewards, and linked to our dataset.We also received metadata on extracted documents, including the medical speciality that generated the document, and the document type.We used this metadata in our document selection, excluding documents that were not consultation documents such as progress notes.

Text Processing
We used the same processing as in this prior work, replacing question and exclamation marks with periods, all spaces with single spaces, and we converted characters that were not alphanumeric, parenthesis, apostrophes, or punctuation with spaces.We removed automatically added text at the beginning and end of the documents that contained information such as identifying information, dates, and the names of the dictating providers and who they sent the documents to; these string patterns are available in the project's Github repository.

Language Models Used
Bag-of-word (BoW) models have a simple understanding of documents, simply counting the frequency that certain words occur in a document 2 .These frequencies then form a vector, which can be used by traditional machine learning algorithms.We again implemented BoW with common choices: L2regularized logistic regression with lbfgs solver, and term frequency-inverse document frequency weighting 3 .We utilized a vector length of 5000 as smaller lengths decreased performance, and higher values did some seem to improve performance, and would lead to a high ratio features and samples.To tune the hyperparameter C, corresponding to the inverse of the lambda regularization factor, we tried values between 0.1 and 5.For SaP, we used a C of 0•6, corresponding to the inverse of the lambda regularization factor.For SaC, we used a C of 1•05.Training and evaluating a final BoW model took around 10 minutes.
When used in natural language processing (NLP), convolutional neural networks (CNN)'s use convolutions of a small number of adjacent words 2 .They then understand a document based on combinations of these small groupings, allowing a reduction of the feature space, but distant word relationships are still considered.We based our CNN models on those developed for general and medical document classification [4][5][6] .We used the Adam optimizer for CNN.We tried hyperparameters from these works and additional, often nearby, values.Trying different word vector length, window lengths, and output channel did not seem to benefit performance, so we used these hyperparameters as in prior work.We investigation dropout values of0.5, 0.55, 0.6, 0.65, 0.7, 0.8, 0.825, 0.85, 0.875, 0.9, 0.95, weight decays of 0, 0.1, 0.01, 0.001, 0.0001, and learning rate of 0.0001, and 0.00005.We used grid-search to investigate many combinations of these values, manually also testing some combinations with values near promising combinations Our final hyperparameter set used 300-length word vectors, window lengths of 3,4 and 5 tokens, and 500 output channels for both targets.For SaP, we used a weight-decay of 0•0001, a dropout rate of 0•85 and a learning rate of 0•0001.For SaC, we used a weight-decay of 0, dropout rate of 0•85, and learning rate of 0•00005.We trained models with up to 100 epochs, with early stopping after 5 epochs (patience) of no improvement in balanced accuracy.We used a batch size of 16, the maximum supported on our hardware.Long short-term memory (LSTM) models are a type of recurrent neural network that understands a document one-word at a time, changing the prediction at every step 2 .LSTM have memory cells that allow the model to better consider what words occurred in other parts of the document.We based our LSTM on previous models developed using regularization to avoid overfitting 7 .We used for Adam optimizer with LSTM.We used a bidirectional LSTM, where there is both a forward and backward LSTM layer, which allows each input token to be understood based on both the tokens before and after.We again tried values from prior work with additional hyperparameters.Initial investigation revealed no benefit from varying embedding lengths or hidden unit dimensions, so we used values from prior work.including word embeddings with length 300, and a hidden unit dimension of 512.We investigated dropout rates including 0.1, 0.2, 0.3, word embedding dropouts of 0.1, 0.001, and weight dropouts of 0, 0.01, 0.001, and learning rates of 0.001, 0.0001, 0.0005.We again used grid search to try many of these combinations.For SaP, we used dropout rate of 0.1, word embedding dropout of 0.01, weight dropout of 0.01, and a learning rate of 0.0001.For SaC we used dropout rate of 0.2, word embedding dropout of 0.1, weight dropout of 0.01, and a learning rate of 0.0005.We again trained models with up to 100 epochs, with early-stopping after 5 epochs (patience) of no improvement in balanced accuracy.We used a batch size of 16, the maximum supported on our hardware.We again shuffled training data by randomly shuffling indices within our custom batch sampler .Training and evaluating a LSTM model took around 4 hours.
The bidirectional encoder representation from transformers (BERT) model was developed by Devlin et al 8 to allow a deep, bidirectional understanding of language.Using state-of-the-art computational resources, they developed a transformer model which allows all pieces of text in a document to be considered at once.This model uses attention to focus on useful relationships.This model can then be fine-tuned to accomplish other tasks, such as our binary survival classification.While allowing all words to be considered with respect to how they relate to each other, these models are limited in only being able to utilize 512 tokens, which represent an entire or part of a word.We used an AdamW optimizer with BERT.During initial investigation, when there were some differences in our dataset processing and target generation, we tested different English BERT models including bert_large_uncased, bert_pretrain_output_all_notes_150000 and bert_pretrain_output_disch_100000 from ClinicalBERT 9 and biobert_pretrain_output_all_notes_150000 and biobert_pretrain_output_disch_100000 from BioBert 10 , We seemed to achieve the best performance with bert_base_uncased so we proceeded with this language model.We conducted hyperparamter tuning investigating weight drops of 0, 0.1, 0.01, 0.0001, 0.00001 and learning rates of 0.001, 0.005, 0.0001, 0.0005, 0.00005, 0.00001, again using grid search for most combinations.For SaP, we used a weight decay of 0, and a learning rate of 0•001, and weight decay of 0•01, and learning rate of 0•00005 for SaC.We again trained models with up to 100 epochs, with early stopping after 5 epochs (patience) of no improvement in balanced accuracy.We used a batch size of 8, the maximum supported on our hardware.We shuffled training data by using the shuffle=True flag in the Dataloader.Training and evaluated a BERT models took around 8-12 hours.
The Longformer model was developed by Beltagy et al 11 as an extension of BERT that can handle larger documents up to 4096 tokens.Instead of BERT's memory requirements which scale quadratically with token length, Longformer's scales linearly, allowing this high limit.It does this by using more selective attention mechanisms, instead of the densely attending mechanism in BERT.Longformer instead uses three more selective attention mechanisms, a sliding window attending to tokens a set number away from a token, a dilated window allowing attention far away, and limited global attention which allows sense attention between only a set number of tokens.We used the pretrained model allenai-longformer-base-4096, a learning rate of 0.0001, a weight drop of 0, and a patience of 10.We again shuffled training data by using the shuffle=True flag in the Dataloader.Due to our limited VRAM, we were only able to run a batch size of 1 when using 4096 tokens, which led to epochs taking almost 24 hours (batch size 2 for 2048 tokens, 4 for 1024 tokens, and 8 for 512 tokens) .As such, we used undersampling for training Longformer, which results in epochs taking less than an hour, and an entire run of training and evaluation taking around 12-24 hours.

Investigating the Impact of Number of Tokens on Performance
To investigate how the number of tokens available to a model would impact performance, we investigated the performance of Longformer using a maximum of 512, 1024, 2048, and 4096 tokens.As above, due to technical constraints we needed to use Longformer with undersampling, which we implemented as there being twice as many documents from those who did not see a psychiatrist or counsellor as those who did.Exploring this using only the training and dev sets, we found that a higher patience was required, so set this to 10. Occasionally, models would not train sufficiently to ever see an improved performance on the dev set.In this situation, we set the models to exit and not evaluate on the training set.We compared the Longformer to BERT and CNN models also using undersampling.Besides patience being set to 10, we did not change the CNN hyperparameters as their performance on the development set was comparable to when using loss-weighting.BERT did not successfully train, as judged by achieving dev set performance above 0.5, when using undersampling and the hyperparameters above.Instead, we used one of the next best performing hyperparameter combinations when tuning originally, which also happened to be the hyperparameters tuned for Longformer, a learning rate of 0.0001 and a weight drop of 0.

Hardware Used
As in prior work, we conducted all computation for this work on a virtual installation of Windows Server 2012 R2, with an eight processor Intel Xeon 8160 CPU, and 16 GB of RAM.We had access to a shared GPU through a NVIDIA GRID V100D-16Q, with 16 GB of VRAM allocated to our virtualisation.We ran the neural models on the virtual GPU, and BoW models on the CPU.

Implementation
As done in the prior work, we implemented our BoW model with the scikit-learn library 12 , while we used PyTorch 13 and PyTorch Lightning 14 for our neural models, the last being used to reduce boilerplate code.We used the Pandas data processing library 15 for data processing, target generation, and analysis.We used the Captum interpretability library to implement integrated gradients 16 .
We used loss weighting to account for the class imbalance of our targets, as most patients did not see a psychiatrist or counsellor in the first twelve months.We adjusted our binary cross entropy loss by the inverse of the relative class proportion.

Novel Multiple Document Neural Interpretation Technique
To interpret a CNN model that predicted seeing a psychiatrist, and a model predicting seeing a counsellor, we first extracted sentences from the documents in the test set.We extracted sentences that had a mean positive attribution score of at least 0.01 after being scored using Captum's layered integrated gradients.During this process, documents were truncated to a maximum of 1500 tokens due to computational constraints.We then provided the sentences to a BERTopic topic model 17 .We generally used the parameters suggested as best practice by the author as of August 28, 2023.We did not specify a sentence transformer, so the default all-MiniLM-L6-v2 model was used.We used UMAP 18 with n_neighbors=15, n_components=5, min_dist=0.0,metric='cosine' and fixed random_state to 42 to keep results consistent.We used HBDSCAN 19 with min_cluster_size=150, metric='euclidean', cluster_selection='eom', prediction_data=True.We used a scikit-learn 12 CountVectorizer with stop_words='english', min_df=2, ngram_range= (1, 2).We utilized nr_topics=21 so that we would have 20 topics, discarding the "-1" topic which represents outliers.For representations, we added KeyBERT with default parameters 20 , and used OpenAI's ChatGPT 21 .Using ChatGPT will send the top four representative sentences for each topic to OpenAI, which we manually checked to ensure did not have any patient information.
While OpenAI states they do not store data sent through their API, we require an extra flag and check for OpenAI to be used in our code to ensure there is no accidental data leakage.Please see the README for more details.
We used ChatGPT with model='gpt-3.5-tubro',exponential_backoff=True, chat=True, and we used the following prompt: """ I have a topic that contains the following documents: [DOCUMENTS] The topic is described by the following keywords: [KEYWORDS]      Based on the information above, extract a short but highly descriptive topic label of at most 5 words.Make sure it is in the following format: topic: <topic label> """

Note SN2. Visualizing word importance of CNN models used to which patients see a counsellor or psychiatrist in the 12 months following initial oncologist consultation document generation
We show a visualization of word importance using integrated gradients for convolutional neural network models predicting whether patients will see a psychiatrist or counsellor in the 12 months following their initial oncologist consultation document being generated.We show a synthesized document created to have similar word importance to a document from a patient that saw both a counsellor and psychiatrist.The darker the green background of a token the more it predicted seeing the provider in this context, while the redder, the more it was a negative predictor.These are the default Captum colors.The modes predict that this patient will see both providers.CNN, convolutional neural network.

Seeing a psychiatrist
This single 41 year old woman was referred to us today for adjunctive chemotherapy due to a large pelvic mass.History of Presenting Illness.Her history started only a few months ago.She had ultrasound done in the context of reporting irregular menstrual cycles and some occasional spotting to her nurse practitioner.Ultrasound last month showed a right adnexal lesions now 8. 1 x 8 .0 x 9 .5 with multiple peripheral nodules.The largest of these peripheral nodules were 4.8 cm.cal5 3 was slightly elevated, other markers were normal.Today in the clinic, she is reporting pain in the lower abdomen, localizing to the right and sometimes making it hard to walk; she is not sure when this started, but suspects it has been at least a few weeks.She has also noticed increasing lose stools over this time, as well as increased flatulence and some urinary frequency.Fortunately, her appetite is intact, and she has not lost any weight.Gynecological History her menarche was at age 12.She reports that her menstrual cycles are usually quite heavy, and typically last 28 days.She has no history of sexual transmitted infections, and she is currently sexually active.Her last mammogram was about six months ago.Past medical and surgical history she is otherwise health, and reports no prior operations.Medications she takes none.allergies no known to drugs.Family History her mother has no history of cancer, though her maternal grandmother had breast at age 42.A maternal uncle had leukemia as a child.Personal history she grew up here in Vancouver and is currently in a relationship, but they lives alone.She denies tobacco or alcohol use, and uses cannabis about once per week.She is currently able to work at her job in retail.Physical Exam She appears younger than her stated age.Her weight was 164 cm, with a weight of 62 kg, and vital signs were normal, with a blood pressure of 108 64 and a heart rate of 71 beats per minute.Her lungs were clear, and I could not find any lymphadenopathy.

Seeing a counsellor
This single 41 year old woman was referred to us today for adjunctive chemotherapy due to a large pelvic mass.History of Presenting Illness.Her history started only a few months ago.She had ultrasound done in the context of reporting irregular menstrual cycles and some occasional spotting to her nurse practitioner.Ultrasound last month showed a right adnexal lesions now 8. 1 x 8 .0 x 9 .5 with multiple peripheral nodules.The largest of these peripheral nodules were 4.8 cm.cal5 3 was slightly elevated, other markers were normal.Today in the clinic, she is reporting pain in the lower abdomen, localizing to the right and sometimes making it hard to walk; she is not sure when this started, but suspects it has been at least a few weeks.She has also noticed increasing lose stools over this time, as well as increased flatulence and some urinary frequency.Fortunately, her appetite is intact, and she has not lost any weight.Gynecological History her menarche was at age 12.She reports that her menstrual cycles are usually quite heavy, and typically last 28 days.She has no history of sexual transmitted infections, and she is currently sexually active.Her last mammogram was about six months ago.Past medical and surgical history she is otherwise health, and reports no prior operations.Medications she takes none.allergies no known to drugs.Family History her mother has no history of cancer, though her maternal grandmother had breast at age 42.A maternal uncle had leukemia as a child.Personal history she grew up here in Vancouver and is currently in a relationship, but they lives alone.She denies tobacco or alcohol use, and uses cannabis about once per week.She is currently able to work at her job in retail.Physical Exam She appears younger than her stated age.Her weight was 164 cm, with a weight of 62 kg, and vital signs were normal, with a blood pressure of 108 64 and a heart rate of 71 beats per minute.Her lungs were clear, and I could not find any lymphadenopathy.

Note SN2. Visualizing word importance of CNN models used to which patients see a counsellor or psychiatrist in the 12 months following initial oncologist consultation document generation, adapter for colour blindness
We show a visualization of word importance using integrated gradients for convolutional neural network models predicting whether patients will see a psychiatrist or counsellor in the 12 months following their initial oncologist consultation document being generated.We show a synthesized document created to have similar word importance to a document from a patient that saw both a counsellor and psychiatrist.The darker the green background of a token the more it predicted seeing the provider in this context, while the more purple, the more it was a negative predictor.The modes predict that this patient will see both providers.CNN, convolutional neural network.

Seeing a psychiatrist
This single 41 year old woman was referred to us today for adjunctive chemotherapy due to a large pelvic mass.. History of Presenting Illness.Her history started only a few months ago.She had ultrasound done in the context of reporting irregular menstrual cycles and some occasional spotting to her nurse practitioner.Ultrasound last month showed a right adnexal lesions now 8. 1 x 8 .0 x 9 .5 with multiple peripheral nodules.The largest of these peripheral nodules were 4.8 cm.cal5 3 was slightly elevated, other markers were normal.Today in the clinic, she is reporting pain in the lower abdomen, localizing to the right and sometimes making it hard to walk; she is not sure when this started, but suspects it has been at least a few weeks.She has also noticed increasing lose stools over this time, as well as increased flatulence and some urinary frequency.Fortunately, her appetite is intact, and she has not lost any weight.Gynecological History her menarche was at age 12.She reports that her menstrual cycles are usually quite heavy, and typically last 28 days.She has no history of sexual transmitted infections, and she is currently sexually active.Her last mammogram was about six months ago.Past medical and surgical history she is otherwise health, and reports no prior operations.Medications she takes none.allergies no known to drugs.Family History her mother has no history of cancer, though her maternal grandmother had breast at age 42.A maternal uncle had leukemia as a child.Personal history she grew up here in Vancouver and is currently in a relationship, but they lives alone.She denies tobacco or alcohol use, and uses cannabis about once per week.She is currently able to work at her job in retail.Physical Exam She appears younger than her stated age.Her weight was 164 cm, with a weight of 62 kg, and vital signs were normal, with a blood pressure of 108 64 and a heart rate of 72 beats per minute.Her lungs were clear, and I could not find any lymphadenopathy.

Seeing a counsellor
This single 41 year old woman was referred to us today for adjunctive chemotherapy due to a large pelvic mass.History of Presenting Illness.Her history started only a few months ago.She had ultrasound done in the context of reporting irregular menstrual cycles and some occasional spotting to her nurse practitioner.Ultrasound last month showed a right adnexal lesions now 8. 1 x 8 .0 x 9 .5 with multiple peripheral nodules.The largest of these peripheral nodules were 4.8 cm.cal5 3 was slightly elevated, other markers were normal.Today in the clinic, she is reporting pain in the lower abdomen, localizing to the right and sometimes making it hard to walk; she is not sure when this started, but suspects it has been at least a few weeks.She has also noticed increasing lose stools over this time, as well as increased flatulence and some urinary frequency.Fortunately, her appetite is intact, and she has not lost any weight.Gynecological History her menarche was at age 12.She reports that her menstrual cycles are usually quite heavy, and typically last 28 days.She has no history of sexual transmitted infections, and she is currently sexually active.Her last mammogram was about six months ago.Past medical and surgical history she is otherwise health, and reports no prior operations.Medications she takes none.allergies no known to drugs.Family History her mother has no history of cancer, though her maternal grandmother had breast at age 42.A maternal uncle had leukemia as a child.Personal history she grew up here in Vancouver and is currently in a relationship, but they lives alone.She denies tobacco or alcohol use, and uses cannabis about once per week.She is currently able to work at her job in retail.Physical Exam She appears younger than her stated age.Her weight was 164 cm, with a weight of 62 kg, and vital signs were normal, with a blood pressure of 108 64 and a heart rate of 71 beats per minute.Her lungs were clear, and I could not find any lymphadenopathy.Values less than 0.000001 are marked 0. We also show Cohen's d effect sizes.Abbreviations: AUC: receiver-operator-curve area-under-curve, BAC: balanced accuracy, BERT: bidirectional encoder representations from transformers, CNN: convolutional neural networks, LSTM: long short-term memory.

Table ST1 : Definition of evaluation metrics reported in this work
TP, true positive; TF, true negative; FP, false positive; FN, false negative; AUC, receiver-operator curve area-under-curve.

Table ST3 . Statistical comparison of models when predicting if patients will see a psychiatrist in the twelve months following their initial oncologist consultation
-values from running two-tailed dependent t-tests between ten runs of each model are shaded light grey, in the top-right of each grid.Cohen's d are shown in the bottom-left of each grid.For the t-tests, using Bonferroni correction, we adjust alpha at 95% confidence to 0.05/6.AUC: receiver-operator-curve area-under- Pcurve, BAC: balanced accuracy, BERT: bidirectional encoder representations from transformers, CNN: convolutional neural networks, LSTM: long short-term memory.

Table ST5 . Statistical comparison of models when predicting if patients will see a counselor in the twelve months following their initial oncologist consultation
-values from running two-tailed dependent t-tests between ten runs of each model are shaded light grey, in the top-right of each grid.Cohen's d are shown in the bottom-left of each grid.For the t-tests, using Bonferroni correction, we adjust alpha at 95% confidence to 0.05/6.Values less than 0.000001 are marked 0. AUC: receiver-operator-curve area-under-curve, BAC: balanced accuracy, BERT: bidirectional encoder representations from transformers, CNN: convolutional neural networks, LSTM: long short-term memory. P

Table ST6 . Statistical comparison of each model when predicting if patients will see a psychiatrist versus seeing a counselor, in the twelve months following their initial oncologist consultation
from running two-tailed dependent t-tests between ten runs of each model.Using Bonferroni correction, we adjust alpha at 95% confidence to 0.05/4.

Table ST8 . Statistical comparison of CNN, BERT and Longformer models when predicting seeing a psychiatrist when using different numbers of tokens and undersampling.
-values from running two-tailed dependent t-tests between ten runs of each model are shaded light grey, in the top-right of each grid.Cohen's d are shown in the bottom-left of each grid.For the t-tests, using Bonferroni correction, we adjust alpha at 95% confidence to 0.05/16.Values less than 0.000001 are marked 0. AUC: receiver-operator-curve area-under-curve, BAC: balanced accuracy, BERT: bidirectional encoder representations from transformers, CNN: convolutional neural networks, LSTM: long short-term memory.Numbers following the names of models correspond to their maximum number of tokens per document; CNN has no such limit.All results compared in this table result from models being trained using undersampling, a technical requirement of us running the Longformer models. P

Table ST9 : Additional topic representations and example sentences from using our new interpretation technique with models trained to predict seeing a psychiatrist
'right breast is negative .','on19eptemberXXth , she underwent a mammogram and a sentinel lymph node biopsy which was converted to an axillary lymph node dissection due to positive nodes clinically on exam .','invasiveductalcarcinoma of the right breast , with biopsy proven right axillary lymph node involvement , er positive ( intensity 2 3 of tumor cells ) , pr negative , her2 positive( 3 3 ).'] [